The dataset contains 110,527 medical appointments of patients in various neighbourhoods in Brazil. It contains 14 observations that contain patient data. The last observation records whether or not, the patient attended the appointment.
The observations are as follows:
#Install missing packages using pip
#Restart Kernel afterwards
!pip install geopy;
!pip install --upgrade plotly;
!pip install --upgrade pyopenssl;
!pip install --upgrade certifi;
!pip install --upgrade pandas==1.1.5;
!pip install --upgrade tensorflow-tensorboard==0.1.1;
!pip --disable-pip-version-check install requests;
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning) #removes warnings about future deprecations in libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopy
import numpy as np
import plotly
%matplotlib inline
print("Pandas version " + pd.__version__) #requires minimum pandas 1.1.5
print("Geopy version " + geopy.__version__)
print("Plotly version " + plotly.__version__)
dataset = ('./Database_No_show_appointments/noshowappointments-kagglev2-may-2016.csv')
df = pd.read_csv(dataset)
df.head()
df.info()
datatype = df.dtypes #checks the datatype of the column
sum_na = df.isna().sum() #the sum of any missing values in each column
na_ = df.isna().any() #checks for any missing values
info = pd.concat([na_,sum_na,datatype], axis=1, keys=['na_', 'sum_na','datatype'])
print(info)
duplicates = df.duplicated().sum()
print("There are {} duplicate records in the dataset ".format(duplicates))
for feature in df.columns:
print('{} has total {} categories \n'
.format(feature,len(df[feature].value_counts())))
features_numeric = ['Age','Scholarship','Hipertension','Diabetes','Alcoholism','Handcap','SMS_received']
df[features_numeric].hist(figsize=(15,20));
df['No-show'].value_counts().plot(kind="bar",
xlabel='No show',
ylabel='Total Count',
title='No Show Distribution',figsize=(8,5));
df['Age'].hist(figsize=(10,8), bins=15);
df['Age'].unique()
df.query('Age <0')
df.query('(Age == 0) & (Alcoholism == 1)')
df['Handcap'].unique()
df['Handcap'].value_counts()
The following function will convert any value above 1 to 1, and 0 will remain 0
def correct_handcap(x):
if x >= 1:
return 1
else:
if x == 0:
return 0
Apply the correct_handcap() function
df['Handcap'] = df['Handcap'].apply(correct_handcap)
df['Handcap'].value_counts()
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['App_weekday'] = df['AppointmentDay'].dt.dayofweek
df['Part_of _week'] = df['App_weekday'].apply(lambda x:'weekend' if x >= 5 else 'weekday')
df['App_dayname'] = df["AppointmentDay"].dt.day_name()
This has created a column containing names of the days of the week, Sunday to Monday.
df['Schdl_hour'] = df['ScheduledDay'].dt.hour
A column containing the hour extracted from the time and date from the 'ScheduledDay' column has ben created.
The following function will categorise the different hours of the day according to the time
def get_duration (t):
if t >= 4 and t <= 9:
return 'Early morning'
elif t > 9 and t <= 12:
return ' Morning'
elif t > 12 and t <=13:
return 'Noon'
elif t > 13 and t <= 16:
return 'Afternoon'
elif t > 16 and t <=20:
return 'Evening'
elif t > 20 and t <= 24:
return 'Night'
elif t < 4 :
return 'Late Night'
df["Part_of_day"] = df['Schdl_hour'].apply(get_duration)
This function uses the hour value derived from the Scheduled Appointmnet time and contained in the Schdl_hour column to classify the time into;
Early Morning - (04:01 to 09:00)
Morning- (09:01 to 12:00)
Noon- (12:00 to 13:00)
Afternoon- (13:01 to 16:00)
Evening - (16:01 to 20:00)
Night - (20:01 to 00:00)
Late Night - (00:01 to 04:00)
This is achieved by obtaining the difference in the number of days between the 'AppointmentDay' and 'ScheduledDay' and passing the values into the 'waiting_days' column.
df['waiting_days'] = (df['AppointmentDay'].dt.date) - (df['ScheduledDay'].dt.date)
df['waiting_days'] = df['waiting_days'].dt.days
df['waiting_days'].describe()
df.query('waiting_days < 0')
df.query('waiting_days < 0').iloc[0,0]
Using the specific Patient Id check for other appointments and their corresponding data.
df.query('PatientId == 7839272661752.0')
Only the appointment with the negative 'waiting_day' difference shows a missed attendance. This is the same for all the records showing a negative difference.
This 5 records will be dropped from the dataset.
Use the same condition to drop the 5 records from teh dataset.
df = df[df.waiting_days >= 0]
Query the dataset to confirm the rows have been dropped.
negative_difference = len(df.query('waiting_days < 0'))
print("There are {} records that display a negative value in the 'waiting_days' column.".format(negative_difference))
df['Attendance'] = df['No-show'].apply(lambda x: 'Attended' if x == 'No' else 'Missed');
df['No-show'].value_counts()
df['Attendance'].value_counts()
df['No-show'] = df['No-show'].apply(lambda x: 0 if x == 'No' else 1);
Converted the 'No-show' column so as 1 denotes 'Yes' and 0 denotes 'No'
Obtain a list of all the neighbourhoods in the dataset
neighbourhood_list = df['Neighbourhood'].unique()
neighbourhood_list
country = ', BRAZIL'
with_country = [neighbourhood + country for neighbourhood in neighbourhood_list]
!conda config --set ssl_verify False
%%time
from geopy.geocoders import Nominatim
locator = Nominatim(user_agent="kenneth_API") ##
geopy.geocoders.options.default_timeout = 1000
LAT = []
LONG = []
for x in with_country:
geolocation = locator.geocode(x, timeout=1000)
if geolocation is None: #checks whether the name matches to a location
LAT.append(None) #appends a nullvalue when location not foundd
LONG.append(None)
else:
latitude = geolocation.latitude #appends geodata to list
longitude = geolocation.longitude
LAT.append(latitude)
LONG.append(longitude);
df_location_data = pd.DataFrame(list(zip(neighbourhood_list, LAT, LONG)), columns=['Neighbourhood', 'Latitude','Longitude'])
df_location_data.info()
df_location_data['Latitude'].isnull()
df_location_data.tail(10)
df_location_data = df_location_data[df_location_data.Neighbourhood != 'ILHAS OCEÂNICAS DE TRINDADE']
df_location_data = df_location_data.append(
{'Neighbourhood': ' ILHA DE TRINDADE','Latitude':-20.524892,'Longitude': -29.324559}, ignore_index=True)
df_location_data.tail(10)
df.query('Neighbourhood == "ILHAS OCEÂNICAS DE TRINDADE"') #check records with that value
df.loc[48754, "Neighbourhood"] = "ILHA DE TRINIDADE" #get indexes and use loc method to change
df.loc[48765, "Neighbourhood"] = "ILHA DE TRINIDADE"
df.loc[48754]
df.query('Neighbourhood == "ILHA DE TRINIDADE"') #confirm the neighbourhood value is changed
This will display 0 if the patient lacks all four observations (Hipertension, Diabetes, Alcoholism and Handcap) and 1 if the have any of the four observations
df['Pre_existing_conditions'] = df['Hipertension'] + df['Diabetes']+ df['Alcoholism']+ df['Handcap']
df['Pre_existing_conditions'] = df['Pre_existing_conditions'].apply(lambda x:1 if x >= 1 else 0)
df['Pre_existing_conditions'].value_counts()
df.head()
cols = df.columns.tolist()
print(cols)
df = df[['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
'AppointmentDay','App_weekday','Part_of _week',
'App_dayname', 'Schdl_hour', 'Part_of_day', 'waiting_days', 'Age', 'Neighbourhood', 'Scholarship',
'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap','Pre_existing_conditions',
'SMS_received', 'No-show', 'Attendance']]
import plotly.express as px
fig = px.line_geo(lat=[0,15,20,35], lon=[5,10,25,30])
fig = px.scatter_mapbox(
df_location_data, # Our DataFrame
lat="Latitude",
lon="Longitude",
center={"lat":-4.047995, "lon":-40.864349},
width=1000, # Width of map
height=1000, # Height of map
hover_data=["Neighbourhood"])
fig.update_layout(mapbox_style="open-street-map")
fig.show()
df.head()
Check the distribution of overall attendance
Attended = df.Attendance == 'Attended' #create masks
Missed = df.Attendance == 'Missed'
Get the number of unique Patient Id's and Appointment Id's
df['PatientId'].nunique() #unique patient Id's
df['AppointmentID'].nunique() #unique number of Appointment Id's
df.Attendance.value_counts()
plt.rc('font', size=15)
plt.rcParams['figure.figsize'] = [25, 7]
plt.bar(x=(df.Attendance.value_counts().index),height=df.Attendance.value_counts(), color='#c3d5e8')
plt.title('Appointment Attendance')
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
attendance_by_percentage=(df.Attendance.value_counts()/df.Attendance.value_counts().sum())*100
attendance_by_percentage
fig = plt.figure(figsize =(8, 8),tight_layout=False)
plt.pie(attendance_by_percentage, labels =('Attended','Missed'), colors=('turquoise','lime'),autopct='%1.1f%%');
plt.title('Attendance %')
plt.legend()
plt.show()
Get the number of Appointments by Gender
df.Gender.value_counts()
print("There are {} appointments attributed to female patients and {} appointments attributed to male patients"
.format(df.Gender.value_counts().values[0],
df.Gender.value_counts().values[1]))
Plot this distribution
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Gender.value_counts(), labels =('Female','Male'), colors=('teal','purple'),autopct='%1.1f%%');
plt.title('Total Gender Distribution %')
plt.legend()
plt.show()
df.Gender[Attended].value_counts() #attended appointments by gender
print("Female patients attended {} while male patients attended {} of {} appointments"
.format(df.Gender[Attended].value_counts().values[0],
df.Gender[Attended].value_counts().values[1],
df.AppointmentID[Attended].value_counts().sum()))
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Gender[Attended].value_counts(), labels =('Female','Male'), colors=('teal','purple'),autopct='%1.1f%%');
plt.title('Attended Appointments %')
plt.show()
Check whether pre-existing conditions contribute to the higher number of female patients
with_pre_existing = df['Pre_existing_conditions'] == 1 #create mask
df.Age[with_pre_existing]
by_male = df.Gender == 'M' #create mask
by_female = df.Gender == 'F'
df.Pre_existing_conditions.value_counts() #value counts for Pre-existing conditions
print("There are {} patients with one or more pre-existing conditions while {} patients lack any pre-existing conditions"
.format(df.Pre_existing_conditions.value_counts().values[1],df.Pre_existing_conditions.value_counts().values[0]))
fig, ax = plt.subplots(1,2, figsize=(20,8))
ax[0].scatter(x=df.Age[by_female],
y=(df['Diabetes']+df['Hipertension']+df['Alcoholism']+df['Handcap'])[by_female],
alpha=0.8, color='teal', label='Female')
ax[0].set_xlabel("Age")
ax[0].set_ylabel("Number of Pre-existing conditions")
ax[0].legend()
ax[0].set_title("Distribution of Pre-existing Between Female and Male by Age")
ax[1].set_title("Distribution of Pre-existing Between Female and Male by Age")
ax[1].scatter(x=df.Age[by_male],
y=(df['Diabetes']+df['Hipertension']+df['Alcoholism']+df['Handcap'])[by_male],
alpha=0.5, color='magenta',label='Male')
ax[1].set_xlabel("Age")
ax[1].set_ylabel("Number of Pre-existing conditions")
ax[1].legend()
plt.show()
Check to see if there is a relationship between Age and Gender
np.mean(df.Age[by_male]), np.mean(df.Age[by_female]) #mean ages by gender
df.Age[by_male].describe() #descriptive statistics of age among men
df.Age[by_female].describe() #descriptive statistics of age among women
Plot distribution of age based on gender
plt.rc('font', size=15)
fig = plt.figure(figsize =(10, 8))
plt.hist(df.Age[by_male], bins=10, alpha=0.5, label='Male', color='purple',edgecolor='black')
plt.hist(df.Age[by_female], bins=10, alpha = 0.5, label='Female', color='teal',edgecolor='black')
plt.axvline(df.Age[by_male].mean(), color='y', linestyle='dashed', linewidth=5, label='Male mean')
plt.axvline(df.Age[by_female].mean(), color='r', linestyle='dashed', linewidth=5, label='Female mean')
plt.axvline(df.Age[by_male].median(), color='y', linestyle='dashdot', linewidth=1, label='Male median')
plt.axvline(df.Age[by_female].median(), color='r', linestyle='dashdot', linewidth=1, label='Female median')
x_ticks = [0,10,20,30,40,50,60,70,80,90,100,110,120]
plt.title('Age Distribution By Gender')
plt.legend()
plt.show();
from scipy.stats import skew
print("The skewness for Age of male patients is {}, and female patients is {}"
.format(skew(df.Age[by_male]),
skew(df.Age[by_female])))
df.Gender[Missed].value_counts() #missed appointmnets by gender
Plot bar graph showing distribution of missed appointments by gender
plt.figure(figsize=(25,10))
plt.bar(x=(df.Gender[Missed].value_counts().index),height=df.Gender[Missed].value_counts(), color='teal')
plt.title('Distribution of Missed Appointments By Gender', fontsize=15)
plt.xlabel='Gender'
plt.ylabel='Missed Appointments'
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
Plot pie chart showing distribution of missed appointments by gender
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Gender[Missed].value_counts(), labels =('Female','Male'), colors=('teal','purple'),autopct='%1.1f%%');
plt.title('Missed Appointments %')
plt.show()
The numbr of missed appointments by male patients: 7723
Females comprise of 65.4% of missed appointments
df.Scholarship.value_counts()
print("There are {} appointments of patients enrolled in the welfare program while the rest {} appointments are of patients not enrolled"
.format(df.Scholarship.value_counts().values[1],
df.Scholarship.value_counts().values[0]))
Plot bar graph to show welfare distribution among appointments
plt.figure(figsize=(10,7))
plt.bar(x=(df.Scholarship.value_counts().index[0]),height=df.Scholarship.value_counts().values[0], color='cyan', label='Non-Recipient')
plt.bar(x=(df.Scholarship.value_counts().index[1]),height=df.Scholarship.value_counts().values[1], color='plum', label='Recipient')
plt.title('No. of Appointments of Patients Receiving Welfare', fontsize=15)
plt.ylabel='Appointments'
ax = plt.gca()
ax.get_xaxis().set_visible(False)
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
Get the number of patients enrolled in the welfare program
with_welfare = df.Scholarship == 1 #receive welfare
no_welfare = df.Scholarship == 0 #not enrolled in welfare program
len(df.PatientId[with_welfare].unique()), len(df.PatientId[no_welfare].unique()) #apply condition to dataset
patient_by_welfare=np.array([len(df.PatientId[with_welfare].unique()), len(df.PatientId[no_welfare].unique())], dtype='int64')
patient_by_welfare #contains distribution of welfare among patients
Plot a pie chart to show the percentages of this distribution
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(patient_by_welfare, labels =('Recipient','Non-recipient'), colors=('indigo','pink'),autopct='%1.1f%%');
plt.title('Welfare Enrollment %')
plt.legend()
plt.show()
Check the relationship between enrolment in the welfare program and pre-existing conditions
df.Pre_existing_conditions[with_welfare].value_counts() # 0 for false, 1 for true
df.Pre_existing_conditions[no_welfare].value_counts() # 0 for false, 1 for true
fig, ax = plt.subplots(2,1)
fig = plt.figure(figsize =(20, 20),tight_layout=False)
ax[0].pie(df.Pre_existing_conditions[with_welfare].value_counts(),
labels =('No pre-existing conditions','Have Pre-existing conditions'),
colors=('darkorange','antiquewhite'),autopct='%1.1f%%')
ax[0].set_title('Pre-existing conditions among welfare recipients %', fontsize=15);
ax[1].pie(df.Pre_existing_conditions[no_welfare].value_counts(),
labels =('No pre-existing conditions','Have Pre-existing conditions'),
colors=('darkorange','antiquewhite'),
autopct='%1.1f%%');
ax[1].set_title('Pre-existing conditions among non-recipients %', fontsize=15);
Investigate Appointment Attendance Among Welfare Recipients
df.Attendance[with_welfare].value_counts() #attendance of patients receiving welfare
Plot a bar graph to show the distribution
plt.figure(figsize=(25,10))
plt.bar(x=(df.Attendance[with_welfare].value_counts().index),height=df.Attendance[with_welfare].value_counts(), color='teal')
plt.title('Attendance Patients Receiving Welfare', fontsize=20)
plt.xlabel='Gender'
plt.ylabel='Appointments'
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Attendance[with_welfare].value_counts(), labels =('Attended','Missed'), colors=('teal','lightskyblue'),autopct='%1.1f%%');
plt.title('Attendance Patients Receiving Welfare %')
plt.legend()
plt.show()
fig = plt.figure(figsize =(10, 8),tight_layout=False)
plt.pie(df.Attendance[no_welfare].value_counts(), labels =('Attended','Missed'), colors=('teal','lightskyblue'),autopct='%1.1f%%');
plt.title('Attendance Patients Not Receiving Welfare %')
plt.legend()
plt.show()
Check the differences in attendance along age between men and women enrolled in welfare
by_age_attended_welfare = df.query('(Attendance == "Attended") & (Scholarship == 1)').groupby('Age')['PatientId'].count()
by_age_attended_welfare_fem = df.query('(Attendance == "Attended") & (Scholarship == 1) & (Gender == "F")').groupby('Age')['PatientId'].count()
by_age_attended_welfare_male = df.query('(Attendance == "Attended") & (Scholarship == 1) & (Gender == "M")').groupby('Age')['PatientId'].count()
by_age_attended_welfare.plot(label="Total")
by_age_attended_welfare_fem.plot(label="Female")
by_age_attended_welfare_male.plot(label="Male")
plt.title('Attendance count vs Age (Welfare Recipients)')
plt.legend()
plt.show();
-There are more male patients between 0 and 20, their number declines afterwards with a sharp rise and fall at around 60.
(df.query('(Scholarship == 1) & (Gender == "F")').groupby('PatientId')['AppointmentID'].count())
(df.query('(Scholarship == 1) & (Gender == "M")').groupby('PatientId')['AppointmentID'].count())
Check the distribution of scheduling of appointments
df['App_dayname'].value_counts() #distribution of appointments by weekday`
df['App_dayname'].value_counts().plot(kind="barh",
ylabel='Days of week',
xlabel='No. of appointments',
title='Distribution of Appointments',
color='teal',
figsize=(8,6)
)
Attended = df.Attendance == 'Attended'
Missed = df.Attendance == 'Missed'
df.App_dayname[Attended].value_counts()
df.App_dayname[Missed].value_counts()
df.App_dayname[Attended].value_counts().sort_values()
plt.rcParams['figure.figsize'] = [25, 15]
plt.bar(x=(df.App_dayname[Attended].value_counts().index),
height=df.App_dayname[Attended].value_counts(),
label='Attended', color='#24615D')
plt.bar(x=(df.App_dayname[Missed].value_counts().index),
height=df.App_dayname[Missed].value_counts(),
label='Missed', color='#00FFEF')
df['App_dayname'].value_counts()
plt.title('Appointment Attendance')
plt.box(False)
plt.grid(True)
plt.legend()
plt.show()
Check for any relationship between attendance and scheduling
df['Schdl_hour'].value_counts().plot.bar(figsize=(10,8),
color='g',
xlabel='Hours of the Day',
ylabel=('Scheduling of Appointments'),
title=('Scheduling of Appointments vs Time'));
Check Attendance when compared to registration/scheduling of appointments
by_percentange = (df['Part_of_day'][Attended].value_counts() / df['Part_of_day'].value_counts()) * 100 #attendance as a percentage by scheduling time
by_percentange
by_percentange.plot(kind="barh",
title="Percentage Attendance");
df['Part_of_day'][Attended].value_counts()
df['Part_of_day'][Attended].value_counts().plot.bar(figsize=(8,6),
color='g',
xlabel='Part of Day',
ylabel=('Attended Appointments'),
title=('Attended Appointments vs Scheduling:Part of day'));
df['Part_of_day'][Missed].value_counts() #missed appointments categorized by scheduling
df['Part_of_day'][Attended].value_counts().plot.bar(figsize=(8,6),
color='g',
xlabel='Scheduling:Part of Day',
ylabel=('Missed Appointments'),
title=('Missed Appointments vs Scheduling'));
plt.figure(figsize=(8,6))
plt.hist((df.waiting_days[Attended]), label='Attended', alpha=0.8)
plt.hist((df.waiting_days[Missed]), label='Missed', alpha=0.7)
plt.grid(True)
plt.box(False)
plt.title('Waiting Days Before Appointment')
plt.xlabel="No. of days"
plt.ylabel="No. of Appointments"
plt.legend()
plt.show()
df.waiting_days[Attended].unique() #display range of missing days
Check mean for waiting days categorized by attendance
np.mean(df.waiting_days),np.mean(df.waiting_days[Attended]),np.mean(df.waiting_days[Missed])
fig = plt.figure(figsize =(10, 8),tight_layout=False)
y=np.array([9,16])
category = ('Attended','Missed')
explosion = [0.2, 0]
col = ('mediumspringgreen','plum')
plt.title('Mean of Waiting Days Before Appointment')
plt.pie(y, labels = category,explode=explosion, colors=col,shadow = True, startangle = 90)
plt.show()
print("The mean for all waiting days is {},\nthe mean for waiting days for attended appointments is {}\nand the mean for waiting days of missed appointments is {}"
.format((np.mean(df.waiting_days).round(0)),
np.mean(df.waiting_days[Attended]).round(0),
np.mean(df.waiting_days[Missed]).round(0)))
Female patients missed more appointments.
Patients not enrolled in welfare attended more appointments than those enrolled in the welfare programs. Welfare does not have an impact on attendance.
Saturday has the least number of appointments.
--Wednesday
-- Tuesday
-- Monday
-- Friday
-- Thursday
--Saturday
Appointments with shorter number of waiting days have a higher attendance.
LIMITATIONS
Tip: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).
Tip: Alternatively, you can download this report as .html via the File > Download as submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.
Tip: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])